{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Book Title Search Using Towhee, Milvus and OpenAI\n",
    "In this notebook we go over how to search for the best matching book titles using [Towhee](https://github.com/towhee-io/towhee) as the data processing pipeline, [Milvus](https://github.com/milvus-io/milvus) as the Vector Database and [OpenAI](https://beta.openai.com/docs/guides/embeddings) as the embedding system."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Packages\n",
    "We first begin with importing the required packages. In this example, the only non-builtin packages are towhee and pymilvus, with each being the client pacakges for their respective services. These packages can be installed using `pip install pymilvus towhee`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "import json\n",
    "import random\n",
    "import time\n",
    "from towhee.dc2 import pipe, ops, DataCollection\n",
    "from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parameters\n",
    "Here we can find the main parameters that need to be modified for running with your own accounts. Beside each is a description of what it is."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "FILE = './data/books.csv'  # https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks\n",
    "COLLECTION_NAME = 'title_db'  # Collection name\n",
    "DIMENSION = 1536  # Embeddings size\n",
    "COUNT = 100  # How many titles to embed and insert\n",
    "HOST = 'localhost'  # Milvus ip address\n",
    "PORT = 19530  # Milvus port\n",
    "OPENAI_KEY = 'your key here' # OpenAI api key\n",
    "OPENAI_ENGINE = 'text-embedding-ada-002' # Which engine to use"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Milvus\n",
    "This segment deals with Milvus and setting up the database for this use case. Within Milvus we need to setup a collection and index the collection. For more information on how to install and run Milvus, look [here](https://milvus.io/docs)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Connect to Milvus Database\n",
    "connections.connect(host=HOST, port=PORT)\n",
    "\n",
    "# Remove collection if it already exists\n",
    "if utility.has_collection(COLLECTION_NAME):\n",
    "    utility.drop_collection(COLLECTION_NAME)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create collection which includes the id, title, and embedding.\n",
    "fields = [\n",
    "    FieldSchema(name='id', dtype=DataType.INT64, descrition='Ids', is_primary=True, auto_id=False),\n",
    "    FieldSchema(name='title', dtype=DataType.VARCHAR, description='Title texts', max_length=200),\n",
    "    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, description='Embedding vectors', dim=DIMENSION)\n",
    "]\n",
    "schema = CollectionSchema(fields=fields, description='Title collection')\n",
    "collection = Collection(name=COLLECTION_NAME, schema=schema)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create an IVF_FLAT index for collection.\n",
    "index_params = {\n",
    "    'metric_type':'L2',\n",
    "    'index_type':\"IVF_FLAT\",\n",
    "    'params':{\"nlist\":1536}\n",
    "}\n",
    "collection.create_index(field_name=\"embedding\", index_params=index_params)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Insert Data\n",
    "Once we have the collection setup we need to start inserting our data. This is done by creating a pipeline using Towhee. Within this pipeline there are two steps, embedding the text that is inputted, and inserting that data into Milvus. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract the book titles\n",
    "def csv_load(file):\n",
    "    with open(file, newline='') as f:\n",
    "        reader = csv.reader(f, delimiter=',')\n",
    "        for row in reader:\n",
    "            yield row[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pipeline which embeds data and inserts into Milvus\n",
    "insert_p = (\n",
    "    pipe.input('id', 'text')\n",
    "    .map(\n",
    "        'text',  # Input columns\n",
    "        'vec',   # Output columns\n",
    "        ops.text_embedding.openai(\n",
    "            engine=OPENAI_ENGINE,\n",
    "            api_key=OPENAI_KEY\n",
    "        )\n",
    "    )\n",
    "    .map(\n",
    "        ('id', 'text', 'vec'),\n",
    "        (), \n",
    "        ops.ann_insert.milvus_client(\n",
    "            host=HOST, \n",
    "            port=PORT, \n",
    "            collection_name=COLLECTION_NAME\n",
    "        )\n",
    "    )\n",
    "                                    \n",
    "    .output()\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Input the book titles\n",
    "for idx, text in enumerate(random.sample(sorted(csv_load(FILE)), k = COUNT)): # Load COUNT amount of random values from dataset\n",
    "    insert_p(idx, text)\n",
    "    time.sleep(3)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Search the Data\n",
    "With the collection setup and all our embedded data inserted, we can begin searching the data. This is done by creating a pipeline similar to the inserting pipeline, but in this one instead of inserting the data, we search the data. The search phrase embedded and its vector is searched across all the stored vectors to find the closest matches. These matches are the most semantically similar titles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pipeline to search through titles.\n",
    "search_p = (\n",
    "    pipe.input('text')\n",
    "    .map(\n",
    "        'text',\n",
    "        'vec',\n",
    "        ops.text_embedding.openai(\n",
    "            engine='text-embedding-ada-002',\n",
    "            api_key=OPENAI_KEY\n",
    "        )\n",
    "    )\n",
    "    .flat_map(\n",
    "        'vec',\n",
    "        ('id', 'score', 'text'),\n",
    "        ops.ann_search.milvus_client(\n",
    "            host=HOST,\n",
    "            port=PORT,\n",
    "            collection_name=COLLECTION_NAME, output_fields=['title']\n",
    "        )\n",
    "    )\n",
    "    .output('id', 'score', 'text')\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "collection.load()  # Current operator needs a load\n",
    "dc = DataCollection(search_p('self-improvement'))\n",
    "dc.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "openai",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.15 (main, Nov 24 2022, 08:29:02) \n[Clang 14.0.6 ]"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "20e2a1b77e7395ec3f747af99b3084257dd9a83ab453e4f5fc77b9434eecfeb0"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
